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Defining the extent of epistasis - the non-independence of the ef¬ 
fects of mutations - is essential for understanding the relationship of 
genotype, phenotype, and fitness in biological systems. The applica¬ 
tions cover many areas of biological research, including biochemistry, 
genomics, protein and systems engineering, medicine, and evolution¬ 
ary biology. However, the quantitative definitions of epistasis vary 
among fields, and its analysis beyond just pairwise effects remains 
obscure in general. Here, we show that different definitions of epis¬ 
tasis are versions of a single mathematical formalism - the weighted 
Walsh-Hadamard transform. We discuss that one of the definitions, 
the backgound-averaged epistasis, is the most informative when the 
goal is to uncover the general epistatic structure of a biological sys¬ 
tem, a description that can be rather different from the local epistatic 
structure of specific model systems. Key issues are the choice of ef¬ 
fective ensembles for averaging and to practically contend with the 
vast combinatorial complexity of mutations. In this regard, we dis¬ 
cuss possible approaches for optimally learning the epistatic structure 
of biological systems. 

There has been much recent interest in the prevalence of 
epistasis in the relationships between genotype, phenotype, 
and fitness in biological systems QH2- Epistasis here is defined 
as the non-independence (or context-dependence) of the effect 
of a mutation, which is a generalization of Bateson’s original 
definition of epistasis as a genetic interaction in which a mu¬ 
tation ’masks’ the effect of variation at another locus [§]. It 
is also in line with Fisher’s broader definition of ’epistacy’ [9]. 
Epistasis limits our ability to predict the function of a system 
that harbors several mutations given knowledge of the effects 
of those mutations taken independently 110 , and makes 
these relationships increasingly more complex (141419] , From 
an evolutionary perspective, the presence of epistatic inter¬ 
actions may limit or entirely preclude trajectories of single¬ 
mutation steps towards peaks in the fitness landscape I20H29I . 
With regard to human health, epistasis complicates our un¬ 
derstanding of the origin and progression of disease I3UH37I . 
Thus, interest in the extent of epistatic interactions in bio¬ 
logical systems has originated from the fields of protein bio¬ 
chemistry, protein engineering, medicine, systems biology, and 
evolutionary biology alike. 

Originally epistasis was considered in the context of two 
genes, but we can define it more broadly as the non¬ 
independence of mutational effects in the genome, whether 
the effects are within, between, or even outside protein coding 
regions (e.g. in regulatory regions). The perturbations may 
go beyond point mutagenesis, but we limit the discussion here 
for clarity of presentation. Importantly, the definition of epis¬ 
tasis can be extended beyond pairwise effects to comprise a 
hierarchy of 3-way, 4-way, and higher-order terms that repre¬ 
sent the complete theoretical description of epistasis between 
the parts that make up a biological system. 

How can we quantitatively assign an epistatic interaction 
given experimentally determined effects of mutations? Since 
epistasis is deviation from independence, it is crucial to Erst 
explicitly state the null hypothesis: asserting what exactly it 
means to have independent contributions of mutations. This 
by itself can be non-trivial. In some cases the phenotype is 
directly related to a thermodynamic state variable, and the 
issue is then straightforward: independence implies additivity 
in the state variable. For example, for equilibrium binding re¬ 
actions between two proteins, independence means additivity 
in the free energy of binding AGbind, such that the energetic 
effect of a double mutation is the sum of the energetic ef- 



Fig. 1. Representation of (A) single mutant, (B) double mutant, and (C) triple 
mutant experiments. Phenotypes are denoted by y g , where g is the underlying geno¬ 
type. g = {piv, with 9i £ {0, 1}: '0' or 4’ indicates the state of the 

mutable site (e.g., amino acid position). The effect of a single, double, triple muta¬ 
tion is given by the red arrows. Pairwise (or second-order) epistasis is defined as the 
differential effect of a mutation depending on the background in which it occurs, for 
example in (B) it is the degree to which the effect of one mutation (e.g. y 10 — t/ 00 ) 
deviates in the background of the second mutation (t/n — yoi)- Thus, the expression 
for second order epistasis is [y 11 — J/ 10 ) — (yoi — t/oo)- The third order and higher 
cases are considered in the main text, 

fects of each single mutation taken independently. However, 
in general, many phenotypes cannot be so directly linked to a 
thermodynamic state variable, and quantification of epistasis 
needs to be accompanied by a proper rationale for the choice 
of null hypothesis. In what follows we will assume this step 
has already been carried out and we will equate independence 
with additivity of mutational effects. Epistasis between two 
mutations is then defined as the degree to which the effect of 
both mutations together differs from the sum of the effects of 
the single mutations. 

In this paper, we describe three theoretical frameworks that 
have been proposed for characterizing the epistasis between 
components of biological systems; these frameworks originate 
in different fields and use seemingly different calculations to 
describe the non-independence of mutations (14II24II33II38H46] . 
We show that these formalisms are different manifestations 
of a common mathematical principle, a finding that explains 
their conceptual similarities and distinctions. Each of these 
formalisms has its value depending on depth of coverage and 
nature of sampling in the experimental data, and the purpose 
of the analysis. In the end, the fundamental issue is to de¬ 
velop practical approaches for optimally learning the epistatic 
structure of biological systems in the face of explosive combi¬ 
natorial complexity of possible epistatic interactions between 
mutations. Demonstrating the mathematical relationships be¬ 
tween the different frameworks for analyzing epistasis is a first 
key step in this process. 


Results 

Basic definitions We begin with a formal definition of geno¬ 
type, phenotype, and the representation of mutational effects. 
Consider a specific sequence comprised of N positions as a bi¬ 
nary string g = {pjv,..., gi} with gi £ {0,1}, where ’0’ and T’ 
represent the ”wild-type” and mutant state of each position, 
respectively. This defines a total space of 2 N genotypes. The 
analysis could be expanded to the case of multiple substitu¬ 
tions per position, but we consider just the binary case for 
clarity here. Each genotype g has an associated phenotype 
yg , which is of the form that the independent action of two 
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mutations means additivity in y. For notational simplicity, we 
will simply write the genotype in a fc-bit binary form, where 
k is the order of the mutations that are considered. For ex¬ 
ample, the effect of a single mutation is simply yi — yo, the 
difference in the phenotype between the mutant and wild-type 
states (Fig. [TJ4). The effect of a double mutant is given by 
yn — Poo (red arrow, Fig. HB), an d its linkage through paths 
of single mutations is defined by a two-dimensional graph (a 
square network) with four total genotypes. Similarly, a triple 
mutant effect is y m — yooo (red arrow, Fig. [Tp), and its link¬ 
age through paths of single mutations are enumerated on a 
three-dimensional graph (a cube) with eight total genotypes. 
More generally, and as described by Horowitz and Fersht [47] , 
the phenotypic effect of any arbitrary n-dimensional mutation 
can be represented by an n-dimensional graph, with 2 n total 
genotypes. Understanding the relationship of the phenotypes 
of multiple mutants to that of the underlying lower-order mu¬ 
tant states is the essence of epistasis, and is described below. 

The biochemical view of epistasis A well-known approach in 
biochemistry for analyzing the cooperativity of amino acids in 
specifying protein structure and function is to use the formal¬ 
ism of thermodynamic mutant-cycles mmm, one manifes¬ 
tation of the general principle of epistasis. In this approach, 
the ’’phenotype” is typically an equilibrium free energy AG 
(e.g. of thermodynamic stability or biochemical activity), and 
the goal is to obtain information about the structural ba¬ 
sis of this phenotype through mutations that represent sub¬ 
tle perturbations of the wild-type state. For pairs of mu¬ 
tations, the analysis involves measurements of four variants: 
wild-type (yoo = A Go), each single mutant (yoi = AG) 3 and 
yio = AG 2 ), and the double mutant (j/n = AG] 3 ^), where 
the subscripts designate the mutated positions, and the su¬ 
perscript ’o’ indicates free energy relative to a standard state 
(Fig. IB). 

From this, we can compute a coupling free energy between 
the two mutations (A 2 Gi, 2 ) as the degree to which the effect of 

one mutation (A Gi) is different when tried in the background 
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of the other mutation (AG 112 ): 

A 2 Gi, 2 = AGi| 2 - AGi 

= (AGi° 2 - AG 2 °) - (AG° - AG 0 °) [ 1 ] 

o 1 

Whereas the AG terms are individual measurements and A G 

terms are the effects of single mutations relative to wild-type, 
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A "G is a second order epistatic term describing the coopera¬ 
tivity (or non-independence) of two mutations with respect to 
the wild-type state. This analysis can be expanded to higher 

order. For example, the third order epistatic term describing 
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the cooperative action of three mutations 1, 2, and 3 (A Gi, 2 , 3 ) 
is defined as the degree to which the second order epistasis of 
any two mutations is different in the background of the third 
mutation: 

A Gi, 2,3 = A Gl,2 (3 — A Gl,2 

3 3 

= AGi : 2,3 ~ AG,j + AGi — AGo [ 2 ] 

i<j i 

3 

Note that A G requires measurement of eight individual geno- 
types (Fig. 1C). More generally, we can define an n-th order 
epistatic term (A G), describing the cooperativity of n muta¬ 
tions, 

n 

AGi n = A G 1 „ + AG il|i2i ... iin _ 1 

il<i2<--<in-l 

n 

+ ( -1 ) 2 ^^ * 2 ,"-4*1-2 +■•■ + (-1)™AG 0 [3] 

*l<*2<."<«n —2 

It is possible to write this expansion in a compact matrix form: 

7 = Gy [4] 


where 7 is the vector of 2 n epistasis terms of all orders, and y 
is the vector of 2 n free energies corresponding to phenotypes of 
all the individual variants listed in binary order. To illustrate, 
for three mutations n = 3, and we obtain 
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In this representation, subscripts in y represent combinations 
of mutations (e.g. you = A G°^, a double mutant) and sub- 
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scripts in 7 represent epistatic order (e.g. 7011 = A Gi, 2 , pair¬ 
wise epistasis between mutations 1 and 2). Thus, equations [l] 
and [ 2 ] correspond to multiplying y by the fourth or eighth row 
of G, respectively, to specify 7011 and 7 m. Note that y and 7 
contain precisely the same information, re-written in a differ¬ 
ent form. The matrix G represents an operator linking these 
two representations of the mutation data and we will return 
to the nature of the operation in a later section. We can write 
a recursive definition for G that defines the mapping between 
y and 7 for all epistatic orders n: 

G n+1 = (_g g) with Go = 1 [S] 

The inverse mapping is defined by y = G~ 1 ' y. This relation¬ 
ship gives the effect of any combination of mutants (in y) as 
a sum over epistatic terms (in 7 ). For example, the energetic 
effect of three mutations 1,2, and 3 (AGi 2 ,3 = ym) is: 

3 3 

AG 12,3 = A Gi, 2,3 + ^ ] A Gij + ^ 3 A Gi + AGq [6] 

i<j i 

Thus, in the most general case, the free energy value of a mul¬ 
tiple mutation requires knowledge of the effect of the single 
mutations and all associated epistatic terms. For the triple 
mutant, this means the wild-type phenotype, the three sin¬ 
gle mutant effects, the three two-way epistatic interactions, 
and the single three-way epistatic term. This analysis high¬ 
lights two important properties of epistasis: ( 1 ) the lack of any 
epistatic interactions between mutations dramatically simpli¬ 
fies the description of multiple mutations to just the sum over 

the underlying single mutation effects, and ( 2 ) the absence of 
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lower-order epistatic interactions (e.g. A Gij = 0) does not 
imply absence of higher order epistatic terms. 

The ensemble view of epistasis In contrast to the biochemi¬ 
cal definition, the significance of a mutation (and its epistatic 
interactions) may also be defined not solely with regard to a 
single reference state as the ’’wild-type”, but as an average 
over many possible genotypes. As we show below, such aver¬ 
aging better represents the epistatic level at which mutations 
operate, and in principle, can separate mutant effects that are 
idiosyncratic to particular genotypes from those that are fun¬ 
damentally important. The concept of averaging epistasis over 
genotypic backgrounds is analogous to the idea of the ’schema 
average fitness’ in the field of genetic algorithms (GA) [501151] . 
which was recently introduced in biology [45] . 

In its complete form, background-averaged epistasis con¬ 
siders averages over all possible genotypes for the remaining 
positions in the ensemble. For example, if n = 3, the epista¬ 
sis between two positions 1 and 2 is computed as an average 
over both states of the third position (e*n, with the averaging 
denoted by V) (see. Fig. 1C): 

e * n = 2 { K ^ 111 “ 2/11 °) “ (?h 01 ~ yioo)] 

+ [(yon - yoio) - (yooi - yooo)] > [7] 
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Thus for n = 3, we can write all epistatic terms: 
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where V is a diagonal weighting matrix to account for averag¬ 
ing over different number of terms as a function of the order 
of epistasis; va = ( — l) qi /2 n ~ qi , where qt is the order of the 
epistatic contribution in row i. More generally, for any number 
of mutations n: 

e = VHy. [ 8 ] 


where y is the same vector of phenotypes of variants as defined 
above, e is the vector of background averaged epistatic terms, 
and H is the operator for background-averaged epistasis, de¬ 
fined recursively as 


following the same rule for subscripts as before. X has the 
recursive definition: 

Xn+1 = (x" x ) with Xo = 1 I 13 ] 

It is worth noting that the inverse of X is X -1 = G, the 
operator for biochemical epistasis (Eq. [5] see Supplementary 
Information). Thus, the multi-dimensional mutant-cycle anal¬ 
ysis is indistinguishable from regression to full order - the case 
in which r = n and e = 0 . 

However, the usual aim of regression is to approximate the 
data with fewer coefficients than there are data points, i.e., 
r < n. To express this, we simply remove the columns from 
X that refer to the epistatic orders excluded from the regres¬ 
sion (i.e., > r): X is multiplied by an 2 n -bj-m matrix Q, the 
identity matrix with columns corresponding to epistatic orders 
higher than r removed, m is the number of epistatic terms up 
to r and is given by m = ( 1 ) • Thus for regression to 

order r, we can define X = XQ, and write 


H n+ 1 = ^ ^ -ii") with H ° ~ 1 [ 9 1 

The recursive definition for the weighting matrix V is 

V n+ i = ^ 2 ^ n _y ) with ^o = l [10] 


y = X(3 + e. [14] 

The linear regression is performed by solving the so-called nor¬ 
mal equations 

$=(X T X)~ 1 X T y [15] 


The matrix H has special significance; its action mathemat¬ 
ically corresponds to a generalized Fourier analysis [55] known 
as the Walsh-Hadamard transform. This converts the pheno¬ 
types of individual variants (in y) into a vector of averaged 
epistasis (in e), an operation that can also be seen as a spectral 
analysis of the high-dimensional phenotypic landscape defined 
by the genotypes studied. In this transform, the phenotypic 
effects of combinations of mutations are represented as sums 
over averaged epistatic terms. 

In summary, the definition of epistasis proposed in evolu¬ 
tionary genetics is a global definition over sequence space, av¬ 
eraging the epistatic effects of mutations over the ensemble of 
all possible variants. In contrast, the biochemical definition 
given in the previous section is a local one, treating a partic¬ 
ular variant as a reference for determining the epistatic effect 
of mutations. 


~ T ~ 

where X X is necessarily square and invertible as long as X 

~ T ~ 

is full column rank and hence X X is full rank. Note that 
in this analysis we compute epistatic terms only up to the 
r th order, but use phenotype/fitness data of all 2 " combina¬ 
tions of mutants. The more general case in which we estimate 
epistatic terms with less than 2 n data points is distinct and is 
discussed below. 

If the biochemical definition of epistasis is a local one, ex¬ 
ploring the coupling of mutations of all order with regard to 
one ’’wild-type” reference, and the ensemble view of epistasis 
is a global one, assessing the coupling of mutations of all order 
averaged over all possible genotypes, then the regression view 
of epistasis is an attempt to project to a lower dimension - 
capturing epistasis as much as possible with low-order terms. 


Estimating epistasis with linear regression A third approach 
for analyzing epistasis is linear regression. For example, when 
we have a complete dataset of phenotypes of all 2 n genotypes, 
we can use regression to define the extent to which epistasis 
is captured by only considering terms to some order r < n. 
That is, whether terms up to the r th order are sufficient for 
effectively capturing the full complexity of a biological system. 
The standard form for a linear regression is a set of equations: 

n n n 

y g — /3o+y~^/3»ffi+y^ Pijkgigjgk+---+e g [ll] 

i =1 i<j i<j<k 

for each genotype g. The /? terms denote the regression co¬ 
efficients corresponding to the (epistatic) effects between sub¬ 
scripted positions, and e g is the residual noise term. In matrix 
form this can be written as 

y = Xf3 + e. [12] 

where X tabulates which regression coefficients are summed 
over for genotypes g. For n = 3, regressing to full order, we 
can write 
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Link between the formalisms The analysis presented above 
leads to a simple unifying concept underlying the calculations 
of epistasis. In general, all the calculations are a mapping 
from the space of phenotypic measurements of genotypes y to 
epistatic coefficients u>, in a general form u> = fl ep i y, where 
fl ep i is the epistasis operator. We give the bottom line of the 
different operators below; their formal mathematical deriva¬ 
tions can be found in the Supplementary Information. 

The most general situation is that of the background- 
averaged epistasis with averaging over the complete space of 
possible genotypes. In this case 

ttepi = VH, [16] 

where H is a 2 n x 2 n matrix corresponding to the Walsh- 
Hadamard transform (n is the number of mutated sites) and 
V is a matrix of weights to normalize for the different num¬ 
bers of terms for epistasis of different orders. The biochemical 
definition of epistasis using one ’’wild-type” sequence as a ref¬ 
erence is a sub-sampling of terms in the Hadamard transform. 
In this case 

Cl epi = VX T H, [17] 

where X is, as defined in Eq. [13] In essence, X T picks out 
the terms in H that concern the wild-type background. Note 
that both these mapping are one-to-one, such that the number 
of epistatic terms (in u>) is equal to the number of phenotypic 
measurements (in y) and no information is lost. In contrast, 
regression to lower orders necessarily implies fewer epistatic 
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Fig. 2. Example of three-way epistasis in the affinity of a PDZ binding domain 
for its ligand. A) In blue the PSD95-PDZ3 domain, and in orange its ligand peptide 
positioned in the binding pocket. The positions under consideration are shown as 
space-filling spheres. B) Measured Kd values in /iM for all eight combinations of two 
amino acids at the three mutable positions. 


terms than data points, which means the mapping is com¬ 
pressive and information is lost. In this case 

£l epi = VX T SH, [18] 

where S (= QQ T ) is the identity matrix but with zeros on the 
diagonal at the orders that are higher than which we regress 
over. 

The fundamental point is that all three formalisms for 
computing epistasis are just versions of the Walsh-Hadamard 
transform, with terms selected as appropriate for the choice 
of a single reference sequence or limitations on the order 
of epistatic terms considered. From a computational point 
of view, it is interesting to note that regression using the 
Hadamard transform makes matrix inversion unnecessary 
(compare with Eq. 1151) . 

An empirical example: a cooperative mechanism in a PDZ 
domain To illustrate the different analyses of epistasis, we 
consider a small case study of three spatially proximal muta¬ 
tions that define a switch in ligand specificity in PSD95-PDZ3, 
a member of the PDZ family of protein interaction modules 
(Fig. [5]\). Two mutations are in PSD95-PDZ3 (G330T and 
H3T2A), and one mutation in its cognate ligand peptide (T- 
2F). The phenotype is the binding affinity, 7Q, and the ab¬ 
sence of epistasis implies additivity in the corresponding free 
energy, expressed as AG° = RTlnKd in kcal mol -1 . Binding 
affinities for this system are from ref. m, and given in Figure 
Up. These quantitative phenotypes are then transformed to 
epistatic terms using Ea. 1161181 ITable 1). 

A number of simple mathematical relationships are evident 
in the data. First, regression is carried out only to the second- 
order and therefore the third-order epistatic term for this 


Table 1. Interaction terms after applying the three different transforms to 
the PDZ-ligand dataset with three mutable positions: three-way mutant-cycle, 
background-averaged epistasis, and regression (to second order). 


analysis does not exist (or, equivalently, is set to zero if the 
epistatic vector /3 is defined to be of full length 2 n ). Second, 
there are some equalities. The regression terms at the high¬ 
est order (second, in this case) are equal to the correspond¬ 
ing terms for the averaged epistasis. This is because X T S 
sets columns corresponding to orders higher than the regres¬ 
sion order to zero, leaving rows corresponding to the highest 
regression order with only one non-zero element, on the di¬ 
agonal. For these rows the entries in the epistasis operators 
VX T SH and VH are equal. Another more trivial equality is 
the highest-order term for the mutant-cycle and averaged epis¬ 
tasis formalisms; there is only one contribution for the highest 
order, and therefore no backgrounds to average over. 

The data also illustrate the key properties of the different 
formalisms. The G330T, H372A, and T-2F mutations repre¬ 
sent a collectively cooperative set of perturbations, as indi¬ 
cated by a significant third-order epistatic term by both mu¬ 
tant cycle and background averaged definitions (7111 = £111 = 
1.67 kcal mol -1 ). But the three formalisms differ in the en¬ 
ergetic value of the lower order epistatic terms. For example, 
G330T is essentially neutral for wild-type ligand binding but 
shows a dramatic gain in affinity in the context of the T-2F 
ligand; thus, a large second-order epistatic term by the bio¬ 
chemical definition (7101 = —2.33 kcal mol -1 ). However, the 
coupling between G330T and T-2F is nearly negligible in the 
background of H372A; as a consequence, the background av¬ 
eraged second-order epistasis term £ 1*1 is smaller (—1.5 kcal 
mol -1 ). Similarly, both biochemical and regression formalisms 
assign a large first-order effect to the T-2F (1**) and H372A 
(* 1 *) single mutations, while the corresponding background- 
averaged terms are nearly insignificant. For example, the free 
energy effect of mutating H372A ( 7010 ) is 2.05 kcal mol -1 in 
the wild-type background, but is —1.71 kcal mol -1 in the back¬ 
ground of the T-2F ligand mutation - a nearly complete rever¬ 
sal of the effect of this mutation depending on context. Thus 
with background averaging, the first order term for H372A 
(£»i*) is close to zero. This makes sense; given the experi¬ 
ment described in Figure 2, the H372A mutation should not 
be thought of as a general determinant of ligand affinity. In¬ 
stead it is a conditional determinant, with an effect that de¬ 
pends on the identity of the amino acid at the —2 position of 
the ligand. Note that the degree of averaging depends on the 
number of mutated sites, and thus the interpretation of mu¬ 
tational effects will depend on the scale of the experimental 
study. 

These examples show that background averaging has the 
effect of ’’correcting” mutational effects for the existence of 
higher-order epistatic interactions. Without background aver¬ 
aging, the effect of a mutation (at any order) idiosyncratically 
depends on a particular reference genotype and will fail to 
account for higher order epistasis which modulates the ob¬ 
served mutational effect. Thus, background averaging pro¬ 
vides a measure of the effects of mutation that represents its 
general value over many related systems, and more appropri¬ 
ately represents the cooperative unit within which the muta¬ 
tion operates. 
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*The three mutable positions in genotypes are T-2F in the ligand, and H372A and 
G330T in the protein, respectively. They are designated in this column as THG’. 
f Measured free energies in kcal/mol, expressed as RTlnKd, at T — 293K 
1 Interacting positions are in the same order as genotypes, for example ’*11' indicates 
the epistasis between amino acid positions 372 and 330 in PSD95-PDZ3. 


The epistatic structure of real systems The analytical expres¬ 
sions in Ea. ll(H18l involves the measurement of phenotypes (y) 
for all 2 n combinatorial mutants, a fact that exposes two fun¬ 
damental problems. First, it is only practical when n is small. 
In such cases (e.g Figure 2, n = 3), the data can be combi- 
natorially complete permitting a full analysis - the local and 
global structure of epistasis, possible evolutionary trajecto¬ 
ries, and adaptive trade-offs [54] . But for the typical size of 
protein domains (n ~ 150), the combinatorial complexity of 
mutations precludes the collection of complete datasets. Sec¬ 
ond, even if it were possible, the sampling of all genotypes is 
not desired; indeed, the majority of systems in such an ensem¬ 
ble are unlikely to be functional and and averages over them 
are not meaningful with regard to learning the epistatic struc¬ 
ture of native systems. How then can we apply these epistasis 
formalisms in practice, especially with regard to background 
averaging? 
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Fig. 3. Examples of matrices Z p introduced to calculate the partial background- 
averaged epistasis, for n = 3. (A) Z 2 for when data for mutants up to second-order 
is available and (B) Z x for when only first-order mutants are available. Both matrices 
are self-similar, which allows their generation for arbitrary order, and are related to 
the so-called logical Sierpiski triangle. For example Z 2 =1 — AS, where A is the 
anti-diagonal identity matrix and S is the Sierpinski matrix (i.e. multigrade AND in 
Boolean logic) for three inputs. 


To develop general principles, we begin with two obvious ap¬ 
proaches that lead to well-defined alternative expressions for 
averaged epistasis. First, consider the case in which the data 
are only ’’locally complete”; that is, we have all possible mu¬ 
tants up to a certain order p < n. We can then define a mea¬ 
sure that is intermediate between epistasis with a single ref¬ 
erence genotype and epistasis with full background-averaging, 
which we will refer to as the partial background-averaged epis¬ 
tasis. For example, for three positions (n = 3) with data 
complete only up to order (p = 2), the partial background- 
averaged effect of the first position (rightmost subscript), is 
calculated as e**i, P = (yooi — ?/ooo+ 2/011 — yoio+ 2/101 —yioo)/3. 
Compared to the full background-averaged epistasis, the par¬ 
tial averages just leaves out the last term, y m — yuo, which 
represents the unavailable phenotype of the triple mutant j/m. 
More generally, we can define this measure of epistasis as an¬ 
other special case of the Hadamard transform: 


e p = W p (Z p oH)y , [19] 


where o designates the element-wise product. W p is again a 
diagonal weighting vector, now given by va = (—1 ) qi /T Piqi 
where qi is the epistatic order associated with row i as defined 
earlier, and T Pi9i = 5 ^= 0 * Note that p > qt because 

mutants of order higher than p are considered absent in the 
dataset. 

The matrix Z p simply serves to multiply by zero the terms 
in the Hadamard matrix that include orders higher than p. In¬ 
terestingly, the Z matrices display a self-similar hierarchical 
pattern (Fig. QTI) and are related to so-called Sierpinski trian¬ 
gles (see ref [55]). This permits a recursive definition in both 
n and p for the product Z p o H, which we will designate as 
F : 

n,p 


F, 


,p 



i 



-i,p— 

-i,p— 


[ 20 ] 


with F np = H p for n < p, and F n 0 is a 2" x 2 n matrix of 
zeros, except for a 1 in the upper left corner. This analysis 
assumes that data are complete up to the order p. If not, 
analytical schemes for background-averaged epistasis such as 
Eas ll9ll20l are not obvious. 

A second analytically tractable case for incomplete data 
arises in regression, where the idea is to estimate epistatic 
terms up to a specified order from available data. This involves 
solving a set of equations similar to the normal equations: 

j3 = Q(X T X^) 1 X T My [ 21 ] 

where M is an s x 2 n matrix constructed from the 2 n by 2 n 
identity matrix by deleting the 2™ — s rows corresponding to 
the unavailable phenotypic data, and X = MXQ, with Q 
defined as above. In order for this system of equations to 
be solvable, a necessary constraint is that s > m; that is, 
the number of data points available should be larger than or 
equal to the number of regression parameters. In addition, the 
data must be such that it is possible to uniquely solve for all 


epistatic terms in the regression. For example, if two muta¬ 
tions always co-occur in the data, it is obviously impossible to 
calculate their independent effects. In such cases, the number 

~ 'J 1 ~ 

of solutions to Eq. [21] is infinite (X X is not invertible). 

In practice, even with ’’high-throughput” assays, we can 
only hope to measure a tiny fraction of all combinatorial mu¬ 
tants due to the vast number of possibilities. In this situation, 
the problem of inferring epistasis by regression may be fur¬ 
ther constrained by imposing additional conditions, termed 
regularization. For example, kernel ridge regression [56] and 
LASSO [57] include a weighted norm of the regression coef¬ 
ficients in the minimization procedure. Regularization comes 
with its own set of caveats [58) . but its application is, unlike 
the approaches in Eq. [T9] and 1211 not conditional on specific 
structure of the data or depth of coverage. 

However, none of these approaches directly addresses the 
problem of optimally defining appropriate ensembles of geno¬ 
types over which averages should be taken. In principle, the 
idea should be to perform background averaging over a rep¬ 
resentative ensemble of systems that show invariance of func¬ 
tional properties of interest. How can we generally find such 
ensembles without the impractical notion of exhaustive func¬ 
tional analysis of the space of possible genotypes? One idea 
is motivated by the empirical finding of sparsity in the pat¬ 
tern of key epistatic interactions within biological systems. 
Indeed, evidence suggests that in proteins, the architecture is 
to have a small subset of amino acids that shows strong and 
distributed epistatic couplings surrounded by a majority of 
amino acids that are more weakly and locally coupled [59H63] . 
More generally, the notion of a sparse core of strong couplings 
surrounded by a milieu of weak couplings has been argued to 
be a signature of evolvable systems [64] , If it can be more 
generally verified, the notion of sparsity might be exploited to 
define relevant strategies for optimally learning the epistatic 
structure of natural systems. One approach is to minimize the 
so-called ^i-norm (the sum of absolute values of the epistatic 
coefficients) in a constrained optimization, which has the ef¬ 
fect of producing many epistatic coefficients with zero or very 
small values m, while projecting onto background-averaged 
epistatic terms: 

min||e||i subject to y = H~ 1 V~ 1 e [22] 

£ 

This procedure is akin to the technique of compressed sens¬ 
ing [BS], a powerful approach used in signal processing to rec¬ 
ognize the low-dimensional space in which the relevant features 
of a high-dimensional dataset occur given the assumption of 
sparsity of these features. The application of this theory for 
mapping biological epistasis has to our knowledge not been 
reported before, but might be explored with focused high- 
order mutational analyses in specific well-chosen model sys¬ 
tems. The necessary technologies for such experiments are 
now becoming available, and should help define practical data 
collection strategies for studying epistasis more generally. 

It is worth pointing out that other approaches that use 
ensemble-averaged information to understand biological sys¬ 
tems have been developed and experimentally tested. For ex¬ 
ample, statistical methods that operate on multiple sequence 
alignments of proteins [551 [75] calculate quantities related to 
epistasis that are averaged over the space of homologous se¬ 
quences. Importantly, these approaches have been success¬ 
ful at revealing a hierarchy of cooperative interactions be¬ 
tween amino acids that range from local structural contacts 
in protein tertiary structures [24)175] to more global functional 
modes [55] • For defining good experimental approaches to 
epistasis, a conceptual advance may come from an attempt to 
formally map the constrained optimization problem described 
in Eq.[22]to the kind of ensemble averaging that underlies the 
statistical coevolution approaches. 


Discussion 

A fundamental problem is to define the epistatic structure 
of biological systems, which holds the key to understanding 
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how phenotype arises from genotype. Here we provide a uni¬ 
fied mathematical foundation for epistasis in which different 
approaches are found to be versions of a single mathemati¬ 
cal formalism - the weighted Walsh-Hadamard transform. In 
the most general case, this transform corresponds to an aver¬ 
aging of mutant effects over all possible genetic backgrounds 
at every order order of epistasis. This approach corrects the 
effect of mutations at every level of epistasis for higher or¬ 
der terms. Importantly, it represents the degree to which the 
effects of mutations are transferable from one model system 
to another, the usual purpose of most mutagenesis studies. 
In contrast, the thermodynamic mutant cycle [J7] (commonly 
used in biochemistry) constitutes a special case of taking a sin¬ 
gle reference genotype and thus no averaging I59II66H7T] . This 
analysis represents the effects of mutations that are specific 
to a particular model system. Regression (commonly used in 
evolutionary biology) is an attempt to capture features of a 
system with epistatic terms up to a defined lower order, often 
to bound the extent of epistasis or to predict the effects of 
higher-order combinations of mutations [72]. The similarity 
of the regression operator to that of the mutant cycle (see Eq. 
m indicates that this approach is also focused around the 
local mutational environment of a chosen reference sequence. 
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In general, background averaging would seem to provide the 
most informative representation of the effect of a mutation. 
However, with the exception of very small-scale studies fo¬ 
cused in the local mutational environment of extant systems, 
it is both impractical and logically flawed to collect combi- 
natorially complete mutation datasets for any system. Thus, 
the essence of the problem is to define optimal strategies for 
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discovering the biologically relevant epistatic structure of sys¬ 
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The notion of sparsity in epistasis provides a general ba¬ 
sis for developing such a strategy, and it will be interesting 
to test practical applications of this concept (e.g. Ea. l22H in 
future work. Defining optimal data collection strategies will 
not only provide practical tools to probe specific systems, but 
might guide us to principles underlying the ’’design” of these 
systems through the process of evolution, and help the rational 
design of new systems. The mathematical relations discussed 
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Supplementary Information: Proofs and extended methods 


A. Expressing the biochemical epistasis operator G as a Hadamard transform: 
G = X 1 = VX t H (Eq. HU) 


First we write the different matrix operators in their recursive form, and then proceed by induction. We have for the re¬ 
cursive form of X: 


X n +i — 


x n 0 
x n x„ 


with Xq = 1 


In order to find the generative function for the inverse X we can write X n +iX n+1 = I: 


( Xn 

0 \ -■ / 

' 11 

o \ 

V Xn 

Xn J Xn + 1 " | 

v 0 

I 


which we can solve by Gauss-Jordan elimination: 


Xn 0 

Xn X r , 

hence we have for the inverse of X : 


I 

O 

i=i 

0 

-1 

X n 

0 

0 

i=i 

i=i 

1=1 

0 

X~n 


X n +i 


V 

-X 


0 

x~ 


with X 0 = 1 


I 0 
0 I 


X 

X 


X 


Which is identical to the recursive form for G: 
Gn + l = 


Gn 

—G„ 


0 

Gn 


We further have: 

H 1 


// 

H n 


Hn 

Hn 


' n+l = 


bVn 

0 


o 

-V n 


with H o = 1 


with Vo = 1 


With the above relations we can derive the equality in the main text expressing G as a Hadamard transform: 
G„ = X' 1 = V n X^H n 


For n = 0 the statement is trivial. We now show by induction that this relation holds for all n. 



( V n XlHn 0 \ 

V - V n XlHn V n X T n Hn ) 

2 X T n Hn 0 \ _ / \V n 

x T n H n -x T n H n y V o 


QED 


0 

~Vn 


xl X T n 

0 Xl 


Hn Hn \ 

H n -Hn ) 


B. Expressing the regression operator as a Hadamard transform: 
q(x T x) 1 X T = VX T SH (Eq. [18} 

We will use X — XQ and S = QQ T as defined in the main text. 
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For the right-hand side we can write 

VX t SH = ±VX T (HH) SH 

where we used H 2 n = 2 n I„, which can be proven straightforwardly by induction using the generative function for H. 

-1 rri 

Rearranging and using X =VX H, we obtain 
VX t SH = (. HSH) 

We thus have to prove 

Q (X T X) "V = il- 1 (HSH) 

~ T 

Left-multiplying both sides by X X (mind the hat is only on the first operator) and right-multiplying by H we are left 
to prove 

~ T ~ T 

X H = X HS 

Left-multiplication by Q yields 

sx'h = sx t hs 

which, again using the relation we proved in section A above, can be rewritten as 

sv-'x - 1 = sv-'x-'s 

or 

SX- 1 = SX- ] S 

given the commutative properties of diagonal matrices S and V _1 . 

This equality indicates that setting certain rows of X to zero (left-hand side) is the same as setting both those rows 
and corresponding columns of X^ 1 to zero (right-hand side). This is obviously not true for every set of rows and columns, and 
needs more discussion. 

We can prove this iteratively starting at regression to order n — 1 and going down to lower order. If regression is done to 
order n — 1 , this means that only the last row of X^ 1 is set to zero, and by construction of X^ 1 (see above) the last column 
only has a non-zero element in this row. This means that in this case the equality is correct. Another way to see this is 
looking at matrix G for n = 3 in its explicit representation in the main text (here G being identified with X^ 1 ) and noting 
that the highest order epistatic term 7 m is the only one that receives a contribution from the highest order (n) mutant term 1 / 111 . 

Next, if regression is performed instead to order n — 2, not only the last row of X is set to zero, but also the rows 
corresponding to n — 1 order mutants. Analogously to above, the only terms in the vector 7 that receive contributions from 
the n — 1 order mutants are the ones in the rows corresponding to n — 1 order of epistasis (since the row corresponding to n th 
order is already set to zero), meaning that their corresponding column again has only one non-zero element. Hence setting 
these rows to zero will directly set their corresponding column to zero, and the equality holds. 

And so forth for regression to order n — 3, etc., etc. 

QED 
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